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Abstract 

A system for Operational Risk management based on the computational 
paradigm of Bayesian Networks is presented. The algorithm allows the con- 
struction of a Bayesian Network targeted for each bank using only internal 
loss data, and takes into account in a simple and realistic way the correla- 
tions among different processes of the bank. The internal losses are averaged 
over a variable time horizon, so that the correlations at different times are re- 
moved, while the correlations at the same time are kept: the averaged losses 
are thus suitable to perform the learning of the network topology and param- 
eters. The algorithm has been validated on synthetic time series. It should 
be stressed that the practical implementation of the proposed algorithm has 
a small impact on the organizational structure of a bank and requires an 
investment in human resources limited to the computational area. 
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1. Introduction 

In the last years a powerful set of tools to study complexity has been 
developed by physicists and applied to economic and social systems; among 
the several topics under investigation the quantitative estimation and man- 
agement of several typologies of risks [l[ , like financial risk 0, 0, 0, H, @] and 
operational risk 0, S] has recently emerged. 

Operational Risk (OR) is defined as "the risk of [money] loss resulting 
from inadequate or failed internal processes, people and systems or from ex- 
ternal events" , including legal risk, but excluding strategic and reputation 
linked risks. Since it depends on a family of heterogeneous causes, in the past 
only few banks dealt with OR management. Starting from 2005 the approval 
of "The New Basel Capital Accord" (Basel II) has substantially changed this 
picture: in fact OR is now considered a critical risk factor and banks are 
prescribed to cope with it setting aside a certain capital charge. 

Basel II proposes three methods to determine this capital: i) the Basic 
Indicator Approach (BIA) sets it to 15% of the bank's gross income; ii) the 
STandardized Approach (STA or TSA) is a simple generalization of the BIA: 
the parcentage of the gross income is different for each Business Line (BL) 
and varies between 12% and 18%; iii) the Advanced Measurement Approach 
(AMA) allows each bank to use an internally developed procedure to estimate 
the impact of OR. Both the BIA and the STA seems overly simplistic, since 
in some way they suppose that the exposure of a bank to operational losses is 
proportional to its size. On the other side, an AMA not only helps a bank to 
set aside the correct capital charge, but may even allow the OR management, 
in the prospect of limiting the amount of future losses. 

Each AMA has to take into account two types of historical operational 
losses: the internal ones, collected by the bank itself, and the external ones 
which may belong to a database shared among several banks. Nevertheless, 
due to the recent interest for OR, only small and not adequately accurate 
historical databases exist and this is why each AMA is required to use also 
assessment data produced by experts. In addition, Basel II provides a classi- 
fication of operational losses in 8 BLs and 7 Loss Event Types (LETs) which 
has to be shared by all the AMAs. Finally, AMAs usually identify the capital 
charge with the 99.9% 1-year Value-at-Risk (VaR), i.e. the 99.9 percentile of 
the yearly loss distribution. 

Among the AMA methods, the most widely used is the Loss Distribution 
Approach (LDA). In LDA the distribution of frequency and the distribution 
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of impact (severity) modeling the operational losses are separately studied 
for each of the 56 pairs (BL, LET). LDA makes two crucial assumptions: 
i) frequency and severity distributions are independent for each pair; ii) the 
distributions of each pair are independent from the distributions of all the 
other pairs. In other words LDA neglects the correlations possibly existing 
between the frequency or the severity of the losses occurring in different pairs. 

The idea of exploiting BNs to study OR has already been proposed in 
l~oj |. and various approaches are possible. The main advantages offered by 



BNs are two: 

• the possible correlations among different bank processes can be cap- 
tured; 

• the information contained into both assessments and historical loss data 
can be merged in a natural way. 

One approach may be to design a completely different network for each 
bank process, trying to determine the relevant variables (in the context of 
each process) and the causal relationship among them; this kind of network 
has only one output node which typically represents the loss distribution for 
the process under investigation. This approach has several drawbacks: i) 
domain experts are needed for each process, in order to properly identify the 
variables and to define the topology of each network; ii) if the historical data 
needs to be used, a system monitoring all the included variables with an 
acceptable frequency and accuracy has to be built; since this kind of network 
can easily reach large sizes (tens of variables) , managing such systems is quite 
challenging for a bank institution; iii) correlations across different processes 
are not taken into account. 



Another approach is to design a unique network composed by a node 
for each process which represents its loss distribution; all nodes are output 
nodes and the operational losses are sufficient to build a historical database, 
so that collecting the data and managing them is much more easier for a 
bank; in comparison with the previous approach even the experts' task be- 
comes simpler since their assessment reduces to an estimate of the losses over 
a certain time horizon; obviously this kind of network is specifically designed 
for capturing the correlations among different processes. This approach re- 
sembles a way of reasoning typical of the field of the Complex Systems: all 
the "microscopic" details inherent to each process (that make the basis on 
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which the first approach is built on) are not included in the model, assuming 
that they can be neglected to a certain extent. 

Let us underline that, as regards the practical implementation inside a 
bank, the difference between the two approaches is huge: in the first ap- 
proach tens of variables for each process need to be monitored, while in the 
second approach only one variable per process (the registered losses) has to; 
considering that an AMA-oriented bank has to track its own internal losses 
in any case, the cost of the proposed implementation is minimum. 

Mixed approaches in which a subnetwork of the kind used in the first 
approach (but usually smaller) is nested into each node representing the loss 
distribution of a process are even possible. 



2. Bayesian Networks 



In order to define a Bayesian Network [12| two elements are necessary: 
a set of random variables V = (Xi, X 2 , . ■ ■ , X n ) and a network of nodes 
corresponding to the random variables in V. In particular the network must 
be a Directed Acyclic Graph (DAG) and the joint Probability Distribution 
Function (PDF) P(Xi,X%, . . . ,X n ) must satisfy the Markov condition, i.e. 
each random variable Xi and the set of all its non-descendents must be 
conditionally independent, given the set of all its parents. It can be proved for 
discrete variables (which turns out to be our case) that the Markov condition 
easily allows to calculate the joint PDF as: 

N 

P(X 1 ,X 2 , ,X n ) = H P(X i |Pa i ), (1) 

t=i 

where Pa^ is the set of random variables whose corresponding nodes are 
parents of the node associated with JQ. 

Both the directed links appearing in the DAG and the values of the con- 
ditional probabilities P(Xi\Pa.i) can be learned from a dataset whose records 
hold the values assumed by each Xi in independent experiments. Even if we 
are not dealing here with the problem of a rigorous definition of what inde- 
pendent experiments are, we will be more formal about this point because it 
is the core of our implementation. Let us associate a random variable to each 

(v) 

node, and to each experiment: X[ is the random variable associated with 
the z-th node and with the p-th experiment. The p-th and the g-th experi- 
ments (p 7^ q) are said to be independent if and X- are independent 
Vi and j. 
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3. Different-times correlations 

One of the fundamental reasons to use BNs to estimate the OR is that if 
correlations do exist among different processes they can be captured through 
the network topology; however the correlation can extend arbitrarily over 
time: an example will help to clarify. Suppose that an employee violates the 
transaction control system with fraud purposes: he succeeds in his aim and 
a money loss is generated in some process of the bank. As a side effect a 
part of the IT infrastructure is damaged, but the failure is discovered and 
repaired only a week later: a loss is generated in the process of machinery 
servicing with a one week lag. At the same time the system remained partially 
unavailable and a certain amount of transactions failed, eventually generating 
losses delayed up to a week in many other processes. 

In order to understand the importance of this point we need to look at the 
structure of a database of historical losses: each record holds the daily losses 
classified by the process in which they occur. The example should have made 
it obvious that the losses registered in different days cannot be considered 
originating from independent experiments (as defined in Section |2|, so a 
database with this structure is in principle useless for learning purposes. To 
overcome this limitation we propose a new approach: the losses are averaged 
over a certain time interval T such that the correlations of the averaged losses 
vanish at different times, but are still present at the same time. 

In such an approach the original database is replaced by a new database 
(which will be called the extracted database) of averaged losses whose number 
of records is ~, being L the number of records into the original database. 
Suppose e.g. that T = 90 is one of the time intervals we are looking for and L 
equals to 1 year: this means that the average losses of a quarter of year are not 
correlated with the average losses of another quarter, but the average losses 
recorded by different processes in the same quarter are still correlated among 
themselves; different quarters may be considered independent experiments, 
thus the extracted database can be used for learning purposes. 

4. Learning Bayesian Networks by aggregate losses 

In Section [3] the idea of averaging the losses over a certain time interval 
is introduced. What we actually do is to sum all the losses belonging to the 
same process and the same time interval: the k-th record in the extracted 
database contains the aggregate loss of the records from (A; — 1) T + 1 to 
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k T, obviously retaining the process classification. Let us suppose again that 
T = 90: the first (second, . . . ) record in the extracted database contains the 
aggregate loss of the records from 1 to 90 (91 to 180, . . . ) in the original 
database. Summing is equivalent to averaging but, as we are going to see, 
makes much more sense in view of the VaR calculation. 

After the new database has been extracted, we can start building the 
network defining the nodes and the allowed states of the associated variables: 
we set the number of states n to 5 for all the variables; the bins are equally 
spaced, being the lower limit and the maximum average loss of each process 
the higher limit. 

The extracted database is then used to learn the structure of the network 
and the conditional probabilities. As hinted in Section [TJ another reason why 
BNs seems to be suitable for OR estimation is that they allow integrating 
of the information coming from the historical database with the informa- 
tion coming from experts' assessment. Topology constraints can be imposed 
before the structure learning is performed, while a prior knowledge can be 
embedded properly setting the marginal distributions of each variable before 
the conditional probability learning is performed. However, we are mainly 
interested in studying the correlations of the losses and thus we choose nei- 
ther to impose topology constraints, nor to embed any prior knowledge about 
the marginal distributions of the variables. 

The joint PDF can then be derived using ([T]) and the marginal PDF for 
each variable calculated. We recall here that the database entries are val- 
ues assumed by the random variables associated with the nodes (see Section 
[2]): if the database used for the learning procedure contains the cumulative 
losses of a quarter (classified by process) the marginal PDFs obtained as the 
output of the BN will be the loss distribution per quarter (classified by pro- 
cess). Let us note that, provided that T = 90 is such that the different-times 
correlations vanish, it is reasonable to consider the loss distributions relative 
to different quarters to be independent. Making the further assumption that 
the loss distributions per quarter are the same for each quarter it is possible 
to calculate the loss distributions over every time horizon, by numerically 
convoluting the loss distributions over the time horizon T an appropriate 
number of times. Indeed, in order to compare the results obtained for differ- 
ent values of T, we calculate the loss distributions and the VaR over a fixed 
time horizon: for this purpose L seems the most natural time horizon to fix. 

Let Pj be the loss distribution of the z'-th process over the time horizon 
T and Pf{k) the value of Pj in the /c-th bin; the convolution of Pj by itself 
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is defined by: 



min(fc,n) 



{Pi * Pi) (*) 



Pl(m)Pl(k-m+l), 



m=max(l,fc+l— n) 



with n = 5 in our case. To obtain P t L (i.e. the loss distribution of the i-th 
process over the time horizon L) P^f has to be convoluted by itself a number 



The VaR over the time horizon L for each process is the 99.9 percentile 
of the convoluted loss distribution and the total VaR is simply the sum 
of the VaRs of the single processes. The 99.9 percentile of the convoluted 
distribution (for each process) can be numerically determined in the following 
way: the convoluted distribution is sampled 10 3 times and the sample is 
arranged in increasing order: the second largest value is the 99.9 percentile 
of the convoluted distribution. Since this procedure involves sampling, it is 
repeated several times and the VaR is calculated as the mean of the obtained 
99.9 percentiles. 

As hinted before, the VaR may be calculated over every desired time 
horizon tuning the number of convolutions; in particular the time horizon 
can be set to 1 year, as required by Basel II, performing ^ convolutions. 

5. Synthetic Data 

In order to investigate our approach, we developed a reliable and tunable 
database of synthetic internal losses: in this way we are able to control the 
correlations between the different processes and some inherent features of 
each process. 

We consider the historical losses of each process as a time series and, 
inspired by [13|], generalize a stochastic algorithm for generating multiple 
time series. We point out that this procedure allows to impose, at least in 
principle, arbitrary cross-correlation functions between each pair of generated 
time series, as well as the auto-correlation function and distribution for each 
generated time series. 




^ times 
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The steps of the algorithm are the followings: i) for each process, L val- 
ues are drawn from an arbitrary distribution; the order in which the values 
are extracted is considered to be a temporal order, so let us call the ex- 
tracte values k(s), where the subscript i — 1, . . . , N indexes the process and 
the argument s = 1, . . . , L defines the temporal ordering, ii) The following 
quantity is calculated: 

N L-l 

^^[ ClJ (t)-Q,(t)f, (2) 
i,j=i t=i 



where N is the number of processes, CV, are the imposed cross-correlation 
(or auto-correlation) functions, while the cross-correlation (or auto- 

correlation if i = j) functions calculated from the generated data: 



Cij(t) 



cov(k, lj) 



L-t 

s<L-t 



(3) 



with (k) = iJ2 s <L l i(s) and cov(Uj) = ((h ~ (0) (k ~ (0)>- From © 
it follows that 0^(0) = 1: in other words, because of its normalization, Cij 
carries no information about the same-time correlations; in order to make the 
whole procedure consistent CV,(0) must also be equal to 1: this explains why 
the summation over t in (T5]) starts from 1 and not from 0. iii) Two values 
belonging to a randomly selected series are randomly chosen and exchanged, 
and the quantity ([2]) is recalculated, iv) If (T5]) has decreased the exchange 
between the two values performed in the step (iii) is accepted, otherwise 
it is rejected. As not limited, (T5]) cannot be normalized and thus 

a threshold below which the algorithm is halted cannot be set. We rather 
choose to iterate the algorithm until ([2]) reaches a plateaux. 

Since we are interested in the change of the correlation between differ- 
ent processes with respect to the time interval T over which the losses are 
averaged, we imposed auto-correlation and cross-correlation functions of the 
form: t 

C l3 (t)=e~-^ (4) 

in fact making such a choice implies that the different-times correlation be- 
tween the processes i and j should be significantly reduced averaging over a 
time interval T ~ Ty. 

Even though the algorithm allows to impose both distributions and Cy, 
in practice a certain degree of compatibility may exist between them: this 



8 



means that, even if ([2]) reaches a plateaux, still c^- and are significantly 
different. In order to overcome this limitation the algorithm is slightly mod- 
ified in the following way: we generate series which are indeed longer than 
L so that a larger basin of values that may fulfill the imposed constraints is 
available; e.g. suppose that the values of the series are drawn from a uniform 
distribution and that the imposed have an higher degree of compatibility 
with another distribution: a subset of values belonging to this distribution 
will be selected by the algorithm. The modified algorithm obviously alters 
the imposed distributions; however we see no reasons to impose strict con- 
straints on the distributions and, on the other hand, as we are interested 
in studying the correlations between the processes, need a high accuracy in 
reproducing the C^. 

6. Results 

We investigate a sort of toy model whose number of processes is limited to 
iV = 3; this choice is the result of a trade-off between our need to considering 
a system complex enough to have a reasonable number of correlated processes 
and the convenience of using series longer enough to be able to carry out the 
average over time and still have a sufficient number of data to perform the 
learning of the network. With L = 5000 it is possible to average over 240 
steps and still have 20 patterns left for the learning. 

The negative exponential distribution has shown to be compatible with 
OH) if the decay matrix is homogeneous, i.e. Ty = r, V i and j and if 
is not too large. In the top panel of FigJT] both c^- and CV,- are shown for 
t = 25. In order to simulate different kinds of processes their means have 
been set respectively at 100, 50 and 10. Using a larger basin of values as 
described in Section [5] both the mean and the variance of the distributions 
do not significantly change, but a heavier tail appears. 

As it is shown in the bottom panel of Fig{TJ averaging over a time interval 
T leaves the form (0J unchanged with a new decay time equal to ^. This 
actually means that, at the cost of reducing the length of the time series, 
averaging effectively removes the different-times correlations: in particular 
when ~ = 2.4 (T ~ 60) all the different-times correlations are reduced to 0.1 
and for T > 80 they can be considered effectively extinguished. 

Since Cy carry no information about the same-time correlations (see Sec- 
tion EJ), in order to study them we look at the learned structure of the net- 
works: in Tab {1] it is shown that the number of links decreases as T increases. 
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This is somewhat expected since, as T increases, the size of the extracted 
database reduces and it becomes more and more difficult to learn from it. 
However for T > 80 the algorithm of structure learning still detects the pres- 
ence of some links: since the different-times correlations are extinguished for 
such a large T, they must be due to the survived same-time correlations. 

To evaluate the consistency of the whole procedure we require that, for 
values of T such that the different-times correlations can be neglected, the 
value of VaR does not depend on T. 

In FigfS] the values of VaR with respect to T are represented; each point 
is the mean over 30 realizations of the procedure described in Section H] and 
the standard deviations are also shown. Indeed from FigfS] it can be seen 
that for T > 60 the values of VaR are compatible among themselves. On 
the other hand, for T < 60 the different-times correlations are still present 
and the records belonging to the extracted database cannot be considered 
independent; nevertheless the learning algorithm for BNs considers them to 
be independent (see Section [2]) and returns unreliable loss distributions: the 
corresponding VaR values are consequently also unreliable. 

7. Conclusion 

A novel approach, based on Bayesian Networks, has been proposed for 
the quantitative management of Operational Risk in the framework of The 
New Basel Capital Accord. The principal features of the proposed approach 
are the following: 1) the whole topology of the network is derived from 
data of operational losses; each node in the network corresponds to a bank 
process and the links between the nodes, which are drawn learning from data, 
model the causal relationships between the processes; this scheme seems 
more flexible than the classification in 56 pairs (BL, LET) prescribed by 
Basel II and has the advantage of representing both the units that generate 
operational losses and the relationships between them. 2) For the first time 
a Bayesian Network is used to represent the influence between correlated 
operational losses that take place in different days exploiting a dataset whose 
records represent losses occurred over T days: using such a dataset the nodes 
in the network represent the aggregate loss over T and the VaR over a time 
horizon T can be computed. The extension to the VaR over the time horizon 
L requires an additional assumption (see Section 0J and is performed by 
convoluting the probability density functions h times and extracting the 
99.9 percentile of the convoluted distribution. 
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Figure 1: Top panel: imposed cross-correlation function C\i (dashed line) and obtained 
cross-correlation function c\2 (solid line), with no average on time; for the sake of readabil- 
ity only the first 1000 values are shown. Bottom panel: imposed cross-correlation function 
C\2 with scaled decay time ^ (dashed line) and obtained cross-correlation function c\2 
(solid line) averaged over a time interval T — 25; for the sake of readability only the first 
40 values are shown. 
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Figure 2: VaR with respect to the time interval T over which the average of the losses 
is performed; each point is the mean over 30 realizations of the procedure described in 
Section [4] and the error bars span over one standard deviation. For T > 60 the values of 
VaR are compatible among themselves. For T < 60 the values are not reliable because 
the records in the extracted database cannot be considered independent. 
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Table 1: The topology of BNs as the time interval T over which the losses are averaged 
varies. The links are more difficult to detect as T increases because the size of the extracted 
database used for the structure learning reduces. For T ~ 60 the different-times correla- 
tions are reduced to 0.1, while for T > 80 they are extinguished and only the same-time 
correlations remain. 
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